A comprehensive software suite for protein family construction and functional site prediction
نویسندگان
چکیده
In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user's query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions.
منابع مشابه
In silico investigation of lactoferrin protein characterizations for the prediction of anti-microbial properties
Lactoferrin (Lf) is an iron-binding multi-functional glycoprotein which has numerous physiological functions such as iron transportation, anti-microbial activity and immune response. In this study, different in silico approaches were exploited to investigate Lf protein properties in a number of mammalian species. Results showed that the iron-binding site, DNA and RNA-binding sites, signal pepti...
متن کاملDesigning, Optimization and Construction of Myelin Basic Protein Coding Sequence Binding to the Immunogenic Subunit of Cholera Toxin
Abstract Background and Objectives: Multiple sclerosis (MS) is a chronic inflammatory autoimmune disease. Mucosal feeding of myelin basic protein binding to the cholera toxin B subunit can reduce the intensity of the immune response in MS patients. Expression system, the domain composition of the fusion protein, accessibility of two domains, codon adaptation index (CAI) and GC contents are v...
متن کاملAn update of DIVERGE software for functional divergence analysis of protein family.
DIVERGE is a software system for phylogeny-based analyses of protein family evolution and functional divergence. It provides a suite of statistical tools for selection and prioritization of the amino acid sites that are responsible for the functional divergence of a gene family. The synergistic efforts of DIVERGE and other methods have convincingly demonstrated that the pattern of rate change a...
متن کاملProtein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches
DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...
متن کاملUsing functional and organizational information to improve genome-wide computational prediction of transcription units on pathway-genome databases
MOTIVATION The prediction of transcription units (TUs, which are similar to operons) is an important problem that has been tackled using many different approaches. The availability of complete microbial genomes has made genome-wide TU predictions possible. Pathway-genome databases (PGDBs) add metabolic and other organizational (i.e. protein complexes) information to the annotated genome, and ar...
متن کامل